In [ ]:
import pandas as pd
In [ ]:
def make_data(n_rows=None):
columns = ['room_id', 'host_id', 'room_type', 'neighborhood', 'reviews', 'overall_satisfaction',
'accommodates', 'bedrooms', 'price']
df = pd.read_csv('../data/airbnb_input.csv', usecols=columns, index_col='room_id').sort_index()
if n_rows:
df = df.head(n_rows)
return df
def make_location_data():
columns = ['room_id', 'latitude', 'longitude']
df = pd.read_csv('../data/airbnb_input.csv', usecols=columns, index_col='room_id').sort_index()
return df
df = make_data(5)
df is the same DataFrame that we used in Unit 2. You can get it by calling function make_data.
In order to drop rows and columns from a DataFrame, you can use funtion drop.
Drop has two important arguments:
Examples...
Dropping row with index 6499 and checking that the original DataFrame df didn't change:
In [ ]:
df1 = df.drop(6499)
print(6499 in df.index)
df1
While with inplace=True, we do change the original DataFrame:
In [ ]:
df.drop(6499, inplace=True)
print(6499 in df.index)
df
Now, go ahead and try to drop a list of rows!
In [ ]:
# code to drop a list of rows
In order to drop columns, we use the axis=1 argument:
In [ ]:
df = df.drop(['reviews', 'price'], axis=1)
df
Sometimes it's useful to keep an unchanged version of your DataFrame.
When doing that, be sure not to fall on this mistake:
In [ ]:
df = make_data(n_rows=5)
new_df = df # new_df is just another way to refer to the same DataFrame as df
df.drop(['room_type', 'neighborhood', 'reviews'], axis=1, inplace=True)
new_df
In this example, we are just declaring a new way to refer to the original DataFrame. So, both the transformations that you apply on df and on new_df will impact the same DataFrame.
If you want to make a copy of a DataFrame, you should use the function copy:
In [ ]:
df = make_data(n_rows=5)
new_df = df.copy() # new_df now refers to a copy of df, we actually have two DataFrames
df.drop(['room_type', 'neighborhood', 'reviews'], axis=1, inplace=True)
new_df
Multiplying a DataFrame column by a constant
Getting the rooms' price per week (7 nights):
In [ ]:
df = make_data(5)
df['price_per_week'] = df.price.multiply(7) # or df['price_per_week'] = df.price * 7
df.head()
Dividing a DataFrame column by another column
Getting the people per bedroom ratio:
In [ ]:
df = make_data(5)
df['people_per_bedroom'] = df.accommodates.divide(df.bedrooms)
# or df['people_per_bedroom'] = df.accommodates / df.bedrooms
df.head()
In Pandas Series and Indexes have a set of string processing methods that can be accessed through the str attribute.
For instance, if we want to make neighborhood names more machine friendly, we can make the neighborhood names all lower case and replace spaces with underscores:
In [ ]:
df.neighborhood = df.neighborhood.str.replace(' ', '_').str.lower()
df.head()
We can also want to remove some pattern from the strings in a certain column:
In [ ]:
df.room_type = df.room_type.str.strip('/apt')
df.head()
To give an example using the str attribute on Indexes, we'll convert the room_id index into a string and then append it the host_id separated by an underscore:
In [ ]:
df = make_data(5)
df = df.reset_index()
df.room_id = df.room_id.astype(str)
df = df.set_index('room_id')
df.index = df.index.str.cat(df.host_id.astype(str), sep='_')
df
In Pandas, group by refers two a process of three chained steps: split-apply-combine.
For instance, if we want to know how many rooms does each landlord have:
In [ ]:
df.groupby('host_id')
In [ ]:
df = make_data()
# reset df index so that we have the room_id as a column
df = df.reset_index()
# only selecting columns room_id and host_id from the DataFrame
df = df[['room_id', 'host_id']]
# groupby() returns a DataFrameGroupBy object and count()
# is the aggregate function that counts room_ids in each group
df = df.groupby('host_id').count()
# now we don't have room_id anymore, we have room counts instead
# so the column name should be changed
df = df.rename(columns={'room_id': 'room_counts'})
df.head()
By default, the columns on which we perform the group by, become indexes. If we don't want this behaviour, we can use the argument as_index=False.
In [ ]:
df = make_data()
df = df.reset_index()
df = df[['room_id', 'host_id']]
df = df.groupby('host_id', as_index=False).count()
df = df.rename(columns={'room_id': 'room_counts'})
df.head(10).tail(5)
We can use group by to learn the average price per room type, for each landlord.
In [ ]:
df = make_data()
df = df[['host_id', 'room_type', 'price']]
df = df.groupby(['host_id', 'room_type']).mean()
df.head(10).tail(5)
Now a small challenge! Check the split-apply-combine documentation and try to find examples, with our dataset, for:
In [ ]:
# write the code for the challenge here
If you have the time, after the practical exercises, explore the merge and concat functions.
In particular, try to solve these exercises: